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ABSTRACT 

An approximation to the likelihood for the 
generalized linear models with random coefficients is derived and is 
the basis for an approximate Fisher scoring algorithm. The method is 
illustrated on the logistic regression model for one-way 
classification, but it has an extension to the class of generalized 
linear models and to more complex data structures, such as nested 
two-way classification. Both full and restricted maximum likelihood 
versions of this algorithm are defined. The estimators of the 
regression parameters coincide with the generalized estimating 
equations of S. L. Zeger and K. Y. Liang (1986) but an essentially 
different class of estimators for the covariance structure parameters 
is obtained. A simulation study explores the properties of the 
proposed estimators. Five tables, 12 figures, and an appendix of 
statistical analysis are included. (Contains 43 references.) 
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LOGISTIC REGRESSION WITH 
RANDOM COEFFICIENTS 

N. T. Longford 
Educational Testing Service, Princeton, NJ 

Abstract 

An approximation to the likelihood for the generalized linear models with 
random coefficients is derived and is the basis for an approximate Fisher 
scoring algorithm. The method is illustrated on the logistic regression model 
for one-way classification, but it has an extension to the class of generalized 
linear models and to more complex data structures, such as nested two- 
way classification. Both full and restricted maximum likelihood versions 
of this algorithm are defined. The estimators of the regression parameters 
coincide with the generalized estimating equations of Zeger and Liang (1986) 
but an essentially different class of estimators for the covariance structure 
parameters is obtained. A simulation study explores the properties of the 
proposed estimators. 

Some key words: covariance structure, Fisher scoring method, logistic 
regression, maximum likelihood, random coefficients. 
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1 Introduction 

Clustered observations arise in a wide variety of applications, including agri- 
cultural and animal breeding studies, econometrics, educational and medical 
research, and survey analysis. For regression of such data random coefficient 
models are usually considered, so as to take account of, or to make inference 
about, the between-cluster variation. 

There are several well-established and well-researched computational al- 
gorithms for fitting random coefficient models with the usual normal assump- 
tions; for a comprehensive review of earlier developments, see Harville (1977). 
In this paper we concentrate on a logistic regression model for correlated bi- 
nary outcomes, although the methods discussed have direct extensions for 
binomial data with link functions other than logistic, and more generally, to 
the entire class of generalized linear models. 

Our development is similar to the generalized estimating equations (GEE) 
of Liang and Zeger (1986) and Zeger and Liang (1986), although we estab- 
lish a more direct connection between the generalized linear models with 
random coefficients and the corresponding computational algorithms. For 
example, our computational algorithm is based on an approximation to the 
log-likelihood, and estimation procedures have their full and restricted (ap- 
proximate) maximum likelihood versions. We propose a new estimator for 
the covariance structure parameters which is a generalization of the maxi- 
mum likelihood estimator in the normal case. 
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Section 2 describes random coefficient models, and gives a brief summary 
of computational algorithms for fitting such models for binary data. Sec- 
tion 3 discusses a maximum likelihood procedure using Gaussian quadrature, 
and Section 4 presents a procedure based on an approximation to the log- 
likelihood. Extensions of this method for more complex data structures, and 
for other generalized linear models are indicated in Section 5. Formal deriva- 
tion of the approximation to the log-likelihood for correlated observations 
with a distribution from the exponential family is given in the Appendix. 
Section 6 describes an adaptation of the exact and approximate procedures 
for restricted maximum likelihood. In Section 7 we study the bias of the gen- 
eralized least squares estimator for the regression parameters. We establish 
a relationship between the bias of the ordinary least squares estimator in the 
normal model with random coefficients and our approximation to the bias in 
the logistic regression. 

The methods are illustrated and compared on two examples (Section 
8), and the properties of the associated estimators are explored in a sim- 
ulation study (Section 9). We conclude that the generalized least squares 
method provides a satisfactory estimator except when there is extremely 
large between-cluster variation. 

We focus on logistic regression and binary data, since they are frequently 
considered in practice, and binary data represent an extreme form of depar- 
ture from normality. Random coefficient methods for clustered observations 
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provide a parsimonious summary for between-cluster differences. 

2 Logistic regression models 

We consider the logistic regression model 

logit{P(y;i = 1 1 S s )} = xyjS + *Sj (1) 

for binary outcomes {y t j} with a one-way layout structure, that is, i = 
l,...,7ij and j = l,...iV 2 , and explanatory variables x. The number of 
elementary-level units is N = n] + . . . + n^. The regression parameters /3 
and the variance parameter cr 2 may be known or unknown, and cr 2 > 0. Each 
regressor a; may be defined either for individuals (elementary observations) 
or for the clusters; in the latter case the index i is redundant, since such 
a variable is constant within clusters. We assume that {Sj} are a random 
sample from the standard normal distribution, Sj ~ N(0, 1), i.i.d. The model 
(1) has a straightforward extension to models with more complex patterns 
of between-cluster variation: 

logitJP^. = 1 1 Sj)} = XtfjS + aa^i, (2) 

where £ is a non-negative definite variance matrix (the possible non-unique- 
ness of the square root in (2) is immaterial), and Sj ~ N r (0, 1). The variables 



z are usually a subset of the variables x, and both contain the intercept 1 = 
(1, - . - , 1) T . This conforms with similar conventions in analysis of covariance. 

Extensions of the model (2) for data with a multi-way layout are straight- 
forward. For example, for the nested two-way layout (clusters within areas) 
it is natural to consider the model 

bgit{P(yy* = l|*j*a»*M)} = x »i*0 + ^iM^i^iU + StfM-Ww ( 3 ) 

where < = !,..., n$,j = 1, . . - , and k = 1, - . - , N 3 are the indices for the 
elementary observation within a cluster, for the cluster within an area, and 
for the area, respectively, and the components of the random vectors 6jk,i 
and 6k,2 are (univariate) independent N(0,1) random variables. 

Further generalization of the models (1) - (3) involves spanning them 
over the class of generalized linear models (Nelder and Wedderburn, 1972, 
and McCullagh and Nelder, 1990): Conditionally on the random vectors {6 j} 
the outcomes {yij} have a specified distribution (e.g., Poisson, gamma, or f), 
and 

MEQ/d*;)} = XijP + ZijStSj, (4) 

where h is a link function. This model is a generalization of (2); the gener- 
alization of (3) is analogous. In addition to the design matrix for variation, 
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given by the variables z and the matrix S, it suffices to specify only the link 
function h and the dependence of the conditional variance vax{yij\&j) on the 
conditional mean E(y,j|£j). 

The model (4) with the identity link h(x) = x and the normal distribu- 
tion, 

Vij = XijjS + SBy-S^j+Ctfi ( 5 ) 

where c {j ~ N(0,o*g), represents a special case. We will refer to (5), with 
al = 1, as the (normal) parent model for (2), and similarly define the parent 
models for (3) and (4). 

There are several methods for maximum likelihood estimation in the 
model (5). For earlier results see Hartley and Rao (1967), Patterson and 
Thompson (1971), and for comprehensive reviews, Searle (1971) and Harville 
(1977). Dempster, Laird and Tsutakawa (1981) describe an application of 
the EM algorithm, Jennrich and Schluchter (1986) and Longford (1987) give 
details of Newton-Raphson and Fisher scoring procedures, and Goldstein 
(1986) describes a generalized least squares algorithm. The EM algorithm 
tends to require a substantially higher number of iterations than the other 
methods, even when it is combined with routines designed tr accelerate con- 
vergence; see Thompson and Meyer t "^6) and Lindstrom and Bates (1988) 
for more detailed discussion. An iteracion of the EM algorithm and one of 



the Fisher scoring algorithm are of comparable complexity, since both require 
computation of certain quadratic forms in the inverses of the unconditional 
variance matrices var(y ; ) = o$l + ZjJSZj, where y ; = (yij, . . .,y nj j) T and 
Zj = (zj"j, ...,zl }j ) T . Lange and Laird (1989) describe some special cases for 
which maximum likelihood solutions can be expressed in a closed form. An 
important subset of these cases can be described as having balanced design 
for the random part, i.e., Zj are identical across the clusters. 

Relative simplicity of the algorithms for fitting the normal model (5) can 
be attributed to the existence of closed form expressions for the conditional 
moments of the random terms given the outcomes {yij} (for the EM algo- 
rithm), and of the log-likelihood and its partial derivatives (for Fisher scoring, 
Newton- Raphson or generalized least squares methods). 

For the logistic regression model (2) the joint likelihood for {y t j} involves 
normal integrals; 

7 = Slog/ ... /PiCW^W ( = £';), (6) 

i i 

where Pj(6j) = {P<i(*i)> y ° i 1 -Pii(*i)> 1 ~ V<> is the conditional likelihood 
for cluster j given Pij{6j) = logft-^Xy/J+JtyJ?^), and # r is the density 
function of N r (0,I). Direct maximization of (6) can be accomplished by using 
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Gaussian quadrature, although with three or more dimensional integrals this 
may be computationally very extensive. 

The EM algorithm avoids evaluation of the integrals in (6), although the 
conditional moments of the random effects 6 j involve integrals, and so there 
appears to be no gain in computational efficiency; on the contrary, owing 
to slow convergence of the EM algorithm the latter integrals have to be 
evaluated a larger number of times than the integrals in a direct maximization 
routine. Laird and Ware (1982) and Stiratelli, Laird and Ware (1984) replace 
the conditional expectations required for the EM algorithm by the conditional 
modes, thus reducing the computational load somewhat. See also Anderson 
and Aitkin (19S5) for discussion of the EM algorithm in this context. 

The intractable form of the log-likelihood (6) has until recently effectively 
discouraged application of the normal-mixture model (2), and the beta- 
binomial model (Williams, 1982), in which variation of the within-cluster 
(i.e., conditional) probabilities is modelled by a beta distribution, has been 
preferred. Simplification of the corresponding likelihood takes place since 
the beta distribution is the conjugate for the binomial distribution. A dis- 
tinct disadvantage of this approach for applications where between-cluster 
variation is of substantive interest is that the scale of the mixing beta distri- 
bution is difficult to relate to the more familiar logit or probit scales. Also, 
the method is specific to binomial data, although other familiar distributions 
have their conjugates. Rosner (1984) and Prentice (1986) discuss and extend 
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the results of Williams (1982). Breslow (1984) adapts the idea for clustered 
Poisson data. 

Bonney (1987) and Connolly and Liang (1988) define a class of logistic 
regression models for non-independent observations by the conditional dis- 
tributions of the individual outcomes given the rest of the outcomes in the 
cluster (or a subset thereof). Although this development is most natural for 
times series or longitudinal data it is equally suitable for situations with a 
symmetric pattern of dependence. 

The generalized estimating equations approach (GEE) of Zeger and Liang 
(1986) and Liang and Zeger (1986) is based on a generalized least squares 
type estimator for the regression parameters 

j9 = (X T V" 1 X)- i X T V- 1 y, (?) 

where X is the design matrix, consisting of the rows x,-j, in lexicographic 
order, y is the corresponding vector of outcomes, and V is a variance ma- 
trix, a function of certain parameters. Zeger and Liang (1986) propose naive 
estimators for these parameters; Prentice (1988) discusses a class of more 
efficient estimators. Thus model fitting by the GEE approach involves iter- 
ations of (7), with the parameters in V replaced by their current estimates, 
and an updating of the parameters involved in V. 

In contrast to the other methods reviewed above, GEE approach does not 
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arise from a model that describes how the data are generated. This renders 
assessment of the assumed model (model checking) difficult, especially for 
small samples. Nevertheless the GEE approach has two important virtues: 
computational simplicity, and that it caters for a general class of distribu- 
tional assumptions as well as link functions. A similar framework, based on 
quasilikelihood, is proposed by McCullagh and Nelder (1990). 

For logistic regression Zeger and Liang (1986) propose a general covari- 
ance structure 

V = {R(p)}*V 0 {R(p)}*, (8) 

where V 0 is the diagonal matrix, V 0 = diag{p t j(l-p t j)}, Pij = logit^x,-^), 
and R(p) is a block-diagonal correlation matrix, with blocks Rj(p) corre- 
sponding to the clusters. The simplest non-trivial choice for R(p) is the 
equicorrelation structure, 

R;(/>) = (1 ( 9 ) 

where I n and J n are the identity matrix and the matrix of ones, of size n x n, 
respectively. Zeger and Liang (1986) and Zhao and Prentice (1990) discuss 
estimation procedures for p. Note that R{p) in (9) is non-negative definite 
for p > — 1/ max(nj — 1); negative correlations can be realized when there is 
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an upper limit for the cluster sizes. 

Alternative parametrizations for R include sequential dependence, e.g., 
tridiagonal Rj or tridiagonal RJ 1 , applicable for longitudinal analysis; see 
Zeger, Liang, and Albert (1988), Zeger and Qadish (1988) and Zhao and 
Prentice (1990) for examples. 

An exact maximum likelihood procedure for the model (1) has been pro- 
posed by Anderson and Aitkin (1985), and in a more general context (two- 
way nested layout) by Irn and Gianola (1988). In Section 3 we describe an 
alternative to these methods based on the Newton-Raphson method. All 
these algorithms rely on numerical integration. 

3 Exact maximum likelihood 

In principle, evaluation of the integrals in (6), as well as of their partial 
derivatives, can be accomplished by Gaussian quadrature- This is feasible 
in practice for models with simple structure of between-cluster variation, 
that is, with a variance matrix E in (2) of low dimension, and then direct 
maximization of (6) is relatively straightforward. For the model (1) we have 

ft] r+oo 

|i = 5>PH;) JMWWW ( 10 ) 

and 
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where lj is the j th summand defined in (6), 

t 

and Sji is the first component of Sj (corresponding to the intercept 1). The 
second-order partial derivatives of (6) are 



d 2 l_ _ ^dlj dlj 



J — CO 

(12) 



d 2 i = j^a/j dij 



J —CO 

(13) 



and 

a 2 / = ^/^V 
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+ •£ expC-/,) / °° />■(«) {zJW i (5)z i - sl(S)} 8* 9{S)d6, 

• (14) 

where Wj(6) is the diagonal matrix of the conditional variances, given 6j = 6, 
Wij{6) = p,j(<5){l - i = 1, . . . , nj, Xj is formed by vertical stacking of 
the row vectors and Zj is the Ttj x 1 vector of ones. The expressions (6) 
and (10) - (14) can be approximated by Gaussian quadrature and used in 
a Newton- Raphson maximization procedure. The generalized least squares 
solution, which corresponds to a = 0, is a suitable starting solution. An 
arbitrary positive number can be used as the starting solution for a (or a 2 ). 
In the examples discussed in Section 8 we set initially a = 1, usually much 
larger than the maximum likelihood solution; a more judicious choice for 
the starting a would save not more than one Newton-Raphson iteration. For 
several model fits for these datasets, and for the number of quadrature points 
in the range 3-11, the maximization procedure required 3-5 GLS iterations 
and 3-6 further iterations based on (10) - (14). 

It appeaxs that 5 quadrature points are sufficient for data with moderate 
number of clusters (say, up to 100), but for larger datasets a higher number 
of quadrature points is required, although 9 points are sufficient even for data 
with 2,000 clusters. Empirical evidence for these observations is provided by 
the analysis of two datasets (Section 8) as well as by the simulation study 
reported in Section 9. 
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4 An approximate Fisher scoring algorithm 

We denote the linear predictor 0y = x,j/3 and for brevity write p,j = Ptj(O), 
w»j = P*j(l -Ptj), w+j = E,^j, = (iftf ~Pij)/ w iji e i = (eii,...,€n ; i) T , 
Wj = (t^ij, . . . ,^'n,j) T - Note that e t j is the generalized residuals familiar 
from the generalized least squares method. 

The conditional log-likelihood for the cluster j has the Taylor expansion 
around S = 0 

\og{Pj(S)} « MPi^J + ^eJwi-i^w+i 

(a*)* 

~-"ir? w-i li=o - (15) 

If all but the first three terms of this expansion are ignored, we obtain 

/ « E lo s{^(°)}- Y los(27ror2) 

i 

= Etog{Pi(o)}-|E^« + TS i2 ^. ( 16 ) 

j L j L i 9 J 

where g j = 1 + cr 2 w + j. 

An approximate maximum likelihood estimator for the parameters /3 
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and a 2 can be defined as the rnaximizer of (16) in the parameter space 
(-00, +oo) p x [0, +00). If we ignore the dependence of Wj on /3 we have 



~ « X T V~ l e, (17) 

a/3 



and 

m - » X T V^ 1 X, (18) 
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where X and e are formed by stacking of {Xj} and {e^}, respectively, 
and V is the block-diagonal matrix with blocks Vj = Wj 1 + <t 2 J„ ; , and 
Wj = Wj(0). See Longford (1988) for derivation of the approximations (17) 
and (18). The approximations imply the generalized least squares estimator 
(7), although with a parametric form for V different from (9): Whereas in the 
GEE approach, (9), equal correlations of the Pearson residuals are assumed, 
our development leads to equal covariances of the generalized residuals. We 
will refer to the estimator based on the latter as the AML (approximate max- 
imum likelihood) estimator. Note that some improvement in the formulae 
(17) and (18) can be achieved by exact differentiation of (16). 

If the linear predictor 6 = x/3 is constant then the GEE and AML es- 
timators of {3 (= /?i) coincide because the corresponding variance matrices 
V are identical. In this case the parameters p and a 2 are related by the 
identity p = <t 2 /(iv + <t 2 ), where w is the common value of all Wij. In general, 
the two parametrizations are more different the larger the variation of the 
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linear predictors 0y. For independent observations (/> = 0, or a = 0) both 
approaches result in the generalized least squares (GLS) estimator. Thus we 
can expect the GEE and AML estimators to perform well for small depar- 
tures from independence of the observations. However, for high correlations 
(covariances) the variance matrices V used for obtaining these two estimators 
have unrealistic properties. For GEE, contrary to expectations, the modelled 
variances of the observations (the diagonal of V) do not depend on p\ the 
within-cluster correlation does not inflate the variances of the observations. 
For AML these variances are inflated by {crwij) 2 , and for large a some of 
the variances Wij(l + <? 2 Wij) may be greater than |, the maximum variance 
of binary data. Thus caution should be exercised when interpreting fitted 
variances for the observations based on the estimated correlation p or the 
variance a 2 . 

The first-order partial derivative of (16) with respect to a 2 is 

W ~ 2 V ft + 2tV ft ) ' ( ] 

and the second-order partial derivative is 

Using the approximate identity var(ejw ; ) = w+jg, we obtain 
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and by similar operations it can be shown that 'E{d 2 l/(d/3da 2 )} ^ 0 . 

5 Extensions 

In this Section we consider extensions for the AML approach parallel to the 
extensions of the basic GEE model (8) - (9). 

First, the development (15) - (16) for the general model (2) yields the 
approximation 

/ * [-N Iog(2ir)- log det(G i ) + 2j>g{P i (0)} 

i 

+ E e I w ; z ;'^ G 7 lz I w ; e i]/ 2 ' ( 22 ) 

i 

where N = rii + . . • + n# 2 is the sample size, Gj = 1+ ZjWjZjl?, and so it 
corresponds to the choice Vj = W^ + Zj^Zj in (17) and (18). Times series 
patterns of dependence can be implemented only by specifying the form of 
Vj, in complete analogy with the GEE approach. 

5.1 Nested two-way layout 

Suppose the individual observations, indexed ijk y are in clusters jk, and the 
clusters are contained in areas k = 1, 2, ... , N$. We will use the notation of 
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the previous sections, with the additional subscript k denoting the area. For 
the model (3) let 



and 

V M = diag(V iM ) + Z ka S 2 Z T kt2 . (23) 

Note that V = diag( Vjt, 2 ) would be the natural choice for the variance matrix 
in (17) and (13). Appendix contains a derivation of the following approxi- 
mation for the density for the nested two-way layout in the context of the 
generalized linear models, 

/ « \{ - N log(27r) - EE lo S det G iM ~ !>§ det G M 

1 k j * 

+ EeJVjJZ M I7 2 G^ 2 Zj 2 V^e*} + EElog{Pi(0)}, (24) 
it * j 

where G iM = 1 + Zj u W iU Z iM 2?i and G* |2 = 1+ Z^V^Zm^, and e* 
is the vector of generalized residuals for the observations in the area k. Note 
that the inverse of the matrix Vjk,i can be expressed in terms of the matrix 

Differentiation of (24) with respect to /3, while ignoring the dependence 
of Vjt,2, Gjjfc.i and Gjt,2 on /3, leads to the estimator given by (17) and (18), 
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with the blocks of V given by (23). For a general parameter £ involved in 
one of the variance matrices, Si or S 2 , we have 



* * ~2 tr ( v + 2? e * v w-ar v « e * (25) 

and 

Proof of (25) and (26) is contained in the Appendix. 

5.2 Generalized linear models 

The Taylor expansion (15) can be applied in the much more general context 
of (conditional) generalized linear models defined by (4), or its extension for 
two-way layout. Details are given in the Appendix. 

The normal model (5) with the identity link is a special case of these 
conditional GLM models. An exact maximum likelihood solution for this 
model is obtained by the iterative procedure defined by the equations (7), 
(17) - (19) and (21) with Vj = <r 2 I + Zj-£Zj. An important implication 
of this is that a maximum likelihood algorithm for fitting the normal model 
(5) can be adapted for the conditional GLM model (4) by replacing all the 
crossproducts required for evaluation of the normal log-likelihood (and of its 
partial derivatives) by its weighted versions, with the weights defined by the 
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applied link and (conditional) variance functions. 

In the normal case the additional scale parameter a 2 = var(e) can be 
estimated as the stationary point of the equation 

N = e T V- 1 e. (27) 

It is advantageous to use the reparametrization ft = a~ 2 <E\ in which case 
(27) is obtained by solving the normal equation for er 2 . Then a 2 = e T U e, 
where U = I + diag(Zj27Zj) does not depend on <r 2 . 

6 Restricted maximum likelihood 

In the normal case the maximum likelihood estimator for the variance matrix 
S is known to be biased, and an unbiased estimator is obtained by maximiz- 
ing the likelihood corresponding to the N - p error contrasts orthogonal to 
the regressor space; see Patterson and Thompson (1971), or Harville (1977), 
for detailed discussion. Harville (1974) has derived an explicit form for this 
restricted log-likelihood (RML). The full and restricted log-likelihoods differ 
by a constant and the term 

R = -ilog{det(X T V- 1 x)}. (28) 

The formulae for the Fisher scoring algorithm can be straightforwardly ad- 
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justed for RML. 

Stiratelli, Laird and Ware (1984) define the restricted log-likelihood for 
logistic mixed regression with random coefficients by integrating out the re- 
gression parameters /3 using a flat prior. RML is most important for models 
with relatively many regression parameters, and maximization of the likeli- 
hood may be computationally demanding exactly in such cases. 

We propose to approximate the restricted log-likelihood by the adjust- 
ment (28), with the variance matrix V, given by (17) or (23), as appro- 
priate, to the respective log-likelihood (16) or (24), or indeed to the exact 
log-likelihood. This proposal has no theoretical foundation, it is based only 
on analogy with the normal case. The simulations reported in Section 9 show 
that the maximum likelihood estimator for the between-cluster variance in 
logistic regression is downward biased, and that the restricted maximum like- 
lihood adjustment reduces its bias (in the simulated situation). 

7 Bias of the GLS estimator 

An important practical issue is related to the performance of the GLS es- 
timator of the regression parameters in presence of positive between-cluster 
variation. In the context of the normal models this issue has been discussed 
by Holt and Scott (1982). The principal question is that of the bias of the 
estimators for the standard error. Since the approximation for the asymp- 
totic information matrix for /3 in logistic regression has a form similar to its 
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exact counterpart for the normal regression, a discussion similar to that of 
Holt and Scott (1982) can be conducted. For illustration we assume a simple 
logistic regression model 



logit{P(y« = - a + Pxij + aS^ 



(29) 



and we associate this model with its parent normal model 



(30) 



with var(cij) = 1, and common design matrix X, regression parameters (a,/?) 
and variance cr 2 = var(£j) for the two models. We denote = var(y|j V) ) in 
(29) and Vb the generalized variance matrix for (30). Thus the information 
about (a,/?) in (29) is approximated by (F B =) X^^X and is equal to 
(Fn =) X T V^X in (30). We have 



1< b = 



X T WX - a 2 £ XjW jZj zJWjXj/gj 

i 

F,- — — x 1 Wx-^i — 2 — — 



(31) 



where Xj (x) is the second column of Xj (X), Zj is a vector of length n,-, and 
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6 < 



£ J = = 1 + cr 2 u; + j. In the corresponding formula for the normal model (30) the 
role of tu+j and W is taken by the number of observations in the cluster j 
(rij), and the Tij x rij identity matrix, respectively. 

If we additionally assume that the average variance, (c =) w+j/nj, is 
constant within the clusters, then the information matrix F/v is equal to the 
information Fb for the dataset in which the design matrices Xj are replicated 
1/c times in each cluster. Since c < \ we see that clustering has a much 
reduced impact on the bias of the estimators of the standard errors. This 
explains the apparent redundancy of the maximum likelihood methods over 
the GLS method in the interviewer variability example, and of the sample 
data analysis in the analysis of death rates in U.S. hospitals in Section 8. 

The comparison of the logistic regression models with their parent models 
carries over to more complex patterns of variation; for instance, for the model 

lofft{P(y« = = aj + b jXij (32) 

with (aj,6j) ~ N 2 (/3, S) the approximate information matrix for /3 is equal 
to Ej GfXjXj, where G, = I + XjWXjE. 

Information about the between-cluster variance a 1 in the model (1) and 
its parent normal model permit a similar discussion. The information in the 
former is approximated by £j w+j/(l + <r 2 w+j) 2 y and in the normal case it 
is equal to £j n j/(! + <r 2 nj) 2 . This implies that the binary models contain 
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relatively little information about a 2 , unless there are many clusters with 
large totals w+j, and/or a 2 is small. In the sense of the comparison above the 
information about cr 2 in (29) is comparable with the parent normal model (30) 
with rij/w+j times fewer observations in each cluster. Note that n/(l + a 2 n) 
is an increasing function of rc, but its derivative is smaller than 1. Thus 
larger clusters contain more information about a 2 , but the increase is slower 
than when the same number of observations is added to the sample in new 
clusters. 

8 Examples 

8.1 Interviewer variability in an attitudinal survey 

We illustrate the methods using a dataset kindly provided by Professor Shure 
from the Department of Psychology, UCLA. The data were collected in a 
survey of public awareness of political issues. The outcome variable is the re- 
spondent's perception of the government's role in his/her life, originally coded 
on the scale 1-5. For purposes of illustration we consider the dichotomous 
variable generated from this multinomial variable by recoding the values 1 
and 2 into 0 and the values 3-5 into 1. There is one explanatory variable for 
the subjects (their gender, RSEX, coded 0 for females and 1 for males), and 
three variables defined for interviewers: their gender, ISEX, political opinion, 
IPOL, on the scale 1-4, from liberal to conservative, and (self-rated) con- 
cern for others, ICON, on the scale 1 - 3. Although the latter two variables 
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involve ordered categories, we will regard them throughout as quantitative 
variables, so as to reduce the number of estimated parameters and to sim- 
plify the discussion. The data contain records of 1008 respondents, each of 
them having been interviewed by one of the 40 interviewers. The workloads 
of the interviewers (numbers of respondents for each interviewer) vary be- 
tween 12 and 85, although 32 interviewers have workloads smaller than 30 
(see Table 1). Preliminary analysis as well as prior information imply that 
the between-cluster variance is likely to be very small, and possibly equal 
to zero, especially if differences in the interviewer-attributes are taken into 
account. This is largely confirmed by our analysis, although the size of the 
sampling variance of the estimate of a 2 provides only weak evidence against 
large values of a 2 . 

Results of model fits using the generalized least squares (GLS), the ap- 
proximate maximum likelihood (AML), the restricted approximate maximum 
likelihood (RAML), the generalized estimating equations (GEE), the max- 
imum likelihood using 3-, 5- and 9-point Gaussian quadrature (ML3, MLS 
and ML9), and the restricted maximum likelihood using 9-point quadrature 
(REML) are displayed in Table 2. 

The estimates and their estimated standard errors obtained by using 5-, 
7- and 9-point quadrature differ by less than 10~ 4 (results for the 7-point 
quadrature are omitted), and for practical purposes it would suffice to use 
3-point quadrature. The AML estimates and the standard errors for the 
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regression parameters are very close to the corresponding values for ML. The 
largest discrepancies, though still trivial, occur for GLS and for RAML. The 
discrepancies of the former are to be expected, in analogy with the (parent) 
normal case. 

The estimate of the variance a 2 in AML closely reproduces its ML coun- 
terpart, although the RAML and REML obtain substantially inflated es- 
timates of the variance. The inflation factors are much higher than what 
could be expected by considering the number of regression parameters and 
the number of interviewers. The estimated standard error for the estimate 
of the standard deviation cr is greater than its AML counterpart but the es- 
timated standard error for the estimates of the variance cr 2 is much smaller. 
This 'paradox' is due to the strong dependence of the information about o 
and a 2 on a 2 . 

Figure 1 contains the plot of the profiles of the various approximations 
to the log-likelihood as functions of cr, and Figure 2 the same plots on the 
variance scale. It is preferable to derive naive confidence intervals based on 
standard errors for the variance, although in a wider range of values of a 2 
the dependence on the standard deviation a appears to be much closer to a 
quadratic curve. Figures 1 and 2 also demonstrate that all the approxima- 
tions to the log-likelihood, except ML3, are good for cr 2 in a wide range of 
realistic values. 

The dependence of the restricted maximum likelihood deviances on the 
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variance closely resembles that for the full maximum likelihood. The cor- 
rection term (28) varies insubstantially as a function of a 2 . The difference 
between AML (ML) and RAML (REML) estimates of <r is much smaller 
when fewer explanatory variables are used. 

The GEE estimate of the 'working' correlation cannot be directly com- 
pared with the estimates of a or a 2 , because it refers to a different scale. 
Note, however, that all the regression estimates are rather small, and so 
there is little variation in the fitted values x/3. 

An approximate likelihood ratio test for the hypothesis of zero between- 
cluster variance can be carried out by comparing the values of -21og-likelihood 
(deviance) : or their approximations, for the models with estimated a 2 and 
the corresponding generalized linear model (assuming a 2 = 0). In our case 
the difference of the deviances is approximately 0.10 for all methods, except 
RAML and REML, where it is equal to 0.66. Note that the RAML and 
REML deviances for the submodel with a 2 = 0 is adjusted by the term (24); 
the RAML deviances cannot be compared with any deviances that refer to 
full maximum likelihood. 

The iterations were terminated when the norm of the correction for the 
estimated parameters was smaller than 10" 4 and the change in the value 
of the approximation to the log-likelihood (6) was smaller than 10' 3 . Each 
method required three or four iterations, although all the methods except 
GLS use the GLS model fit as the starting solution. The times given in 
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Table 2 are the physical times (in seconds) that elapsed while fitting the 
model by the respective methods. The times for all the methods apart from 
GLS include the time taken to fit the GLS as the starting solution. Thus an 
iteration of AML, RAML, or GEE took 1.7 - 2.5 seconds. The time required 
for fitting ML increases with the number of quadrature points, about 10.5 
seconds per point. All the model fits were carried out on an IBM/PC using 
the GAUSS software. 

There appeax to be consistent differences among male and female in- 
terviewees (the corresponding t-ratios are about 1.8), but the interviewer's 
attributes are not significant. Note however, that the estimated regression 
parameters, if taken at face value, are quite large: For a given respondent 
the logit of his/her response could differ by as much as 0.5 for a pair of inter- 
viewers with different sexes and extreme values of the attribute ICON. Even 
though there are 40 interviewers, the substantive conclusion of the analysis 
is that the data contain little information about interviewer variability (and 
even less information about the various pairwise comparisons of the inter- 
viewers). The estimates of the corresponding standard deviation a are about 
0.1, but even the value of 0.25 is quite feasible. Differences associated with 
such variation can be easily translated to the probability scale, and are in 
the present context substantial. 
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8.2 Death rates of Medicare patients at U.S. hospitals 

The Health Care Finance Administration (HCFA) publishes annual reports 
giving for each acute care hospital in the U.S.A. the number of patients 
treated during the year and the number of deaths among those over 65 years 
insured by the Medicare system. The data are given for each of 14 diag- 
nostic categories. The within-hospital death rates are compared with the 
national death rate for each diagnostic category and statistically significant 
comparisons (at a nominal 5% level) pointed out. 

Clearly such a system of monitoring the quality of health care has a num- 
ber of problems, including the choice of the outcome (died within 30 days of 
admittance, or survived), but one addressable issue is that of the risk asso- 
ciated with a patient's (hypothetical) selection of the hospital. Adjustment 
for the health condition of the patient at the time of admittance appears 
to be essential, but the'relevant data, which consists of various measures of 
severity of the condition, is very costly to obtain because it requires time 
consuming abstracting from medical records by qualified staif. 

As part of a large study assessing the quality of care under the Medicare 
Prospective Payment System (Kahn et al., 1990), a stratified probability 
sample of 297 hospitals was selected from the list of all U.S. acute care 
general hospitals active during the years 1981/82 and 1985/86. A set of 
standardized variables measuring patient's severity of condition at admission 
was extracted from a small number of randomly selected patient records (3-4 
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patients from each hospital and each time period per disease). An assessment 
of the quality of the medical care given to each patient was also performed. 
These assessments were coded in a quantitative variable called PROCESS. 
The age of each patient was also considered as a predictor variable. 

For the complete national data for fiscal year 1986, analyzed by Jencks 
et al. (1988) using an alternative method, we consider the logistic random- 
effects analysis of variance model 

logit{P(y ti = 1 | Sj)} = p + aSj. (33) 

The estimates and standard errors for the parameters fi and a 2 for four diag- 
nostic conditions with high mortality rates are given in Table 3. We see that 
the data contain abundant information about between-hospital variation; 
each estimated variance is highly significant (using the t- or the likelihood ra- 
tio test). The estimated variances are substantial; for example, a pneumonia 
patient has an estimated probability of survival 1/{1 + exp(— 1.57 — 0.28)} = 
0.864 in a hospital with Sj = -1, and 1/{1 + exp(-1.57 + 0.28)} = 0.784 in 
a hospital with Sj = +1. 

Unlike for the previous example, the estimate of the standard error of fi 
is affected by the between-cluster variance quite substantially; the ratios of 
the standard errors for ML and for GLS are in the range 1.3 — 1.5. This is a 
consequence of the data containing a large number of large clusters (several 



31 



hospitals admit annually more than 1,000 patients for a specific condition). 

Adjustment for the explanatory variables would be expected to reduce the 
amount of between-hospital variation. We discuss here only the analysis for 
pneumonia. The death rate, not adjusted for severity, of the patients selected 
into the 1981/82 sample was 0.148 (1216 patients with mortality data), and 
the estimate of the variance <r 2 corresponding to the random effects ANOVA 
model (33) is 0.0960 (standard error 0.1610). In contrast, the death rate 
for the 1985/86 sample is 0.171 (1320 patients with mortality data), but the 
fitted variance is negative, equal to -0.0030 (standard error 0.1240). 

Results for some of the logistic regressions for the survey data are dis- 
played in Tables 4 (1981/82) and 5 (1985/86). Patient's severity is repre- 
sented by the 11 variables used in Jencks et al. (1988), of which the vari- 
able APACHE II (Knaus, Draper and Wagner, 1985), is the most impor- 
tant. There are five stratifying variables (including dummy variables that 
categorize hospitals according to size), and the variable PROCESS is also 
considered. The variables APACHE II and PROCESS have been linearly 
rescaled to have mean zero and standard deviation 1. The ML estimates 
for the between-hospital variance are negative (zero) for most models for the 
1985/86 data, whereas for 1981/82 they are positive, with exception of the 
model with adjustment for all the variables. Adjustment for severity appears 
to decrease the estimate of the between-cluster variance, while adjustment 
for process has the opposite effect, though to a much lesser degree. However, 
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the estimated standard errors associated with these estimates of a 2 axe so 
large that we have little confidence that severity adjustment does reduce the 
differences among the hospitals. The data with all the severity measures 
have too little information about between-cluster variation. The negative 
estimated variances are clearly unrealistic; in fact they could not be realized 
in hospitals with more than about 100 patients. 

It would seem that larger samples are required to obtain a meaningful 
estimate for <x 2 . The approximate information for cr 2 given by (21) can be 
used to decide about the survey design that would improve or optimize the 
information about a 2 . Suppose, for simplicity, that the same number of pa- 
tients, n, is to be sampled from each selected hospital, and a total of N 
patients will be selected to the survey. The approximate information for a 2 
is equal to ^Nnw 2 /(1 + nwa 2 ), whereto = p(l-p) is the common conditional 
variance (given 6j = 0), and for fixed JV, w and a 2 it has a unique maximum 
for n* = 1/wcr 2 . For pneumonia we have w « 0.15 and almost certainly 
a 2 < 0.3, and so it is very likely that n* > 20. This suggests that a design 
with fewer hospitals and more patients from each selected hospital would be 
much more informative about <x 2 . However, design with independent obser- 
vations (one patient per hospital) is most informative about the regression 
parameters. A suitable tradeoff is likely to be closer to the design with fewer 
sampled hospitals. 
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9 Simulations 

The methods discussed in Sections 3 and 4 were compared in a simulation 
study. Data were generated according to the simple logistic regression model 
with a random intercept: 

logit{P(y« = l|*i)} = a + Pxy + vSj, (34) 

with 40 clusters (j), four of them containing 21, 22, . . . , 30 observations each. 
The regression parameters were set to a = 0 and /? = 1. The values of the 
regressor x were drawn from the uniform distributions on (—1, 1) for one set 
of 100 simulations, and on (1,3) for another set of 100 simulations. The 
(regression) designs are referred to as U(-l,l) and U(l,3). The values of the 
standard deviation a were set to 0, 0.1, . . . , 1 in each a set of simulations. 

The purpose of the simulations was to compare the estimators of the pa- 
rameters a, /?, and a (cr 2 ), and the standard errors of these estimators, to 
assess the importance of taking account of between-cluster variation in esti- 
mating a and /3, as well as to compare the computational complexity of the 
methods. Of interest is the agreement of the AML, RAML and GEE esti- 
mators with their exact ML counterparts and the accuracy of the estimated 
standard errors for predicting the observed mean squared errors. 

We summarize results separately for each estimated quantity (parameter 
or standard error): 
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1. Slope (3: The biases of the GLS, AML, RAML, ML9 and REML9 esti- 
mators of the slope ft are plotted in Figures 3 and 4 for the resf rive designs 
U(-l,l) and U(l,3). The AML, RAML and GLS estimators of the slope have 
nearly identical means, and their biases are essentially a decreasing function 
of the variance cr 2 . The GEE estimator is essentially indistinguishable from 
the AML estimator, and is therefore omitted from the plots. The ML9 and 
REML9 estimators are also nearly identical, but their biases are much smaller 
than those of the GLS, AML and RAML estimators. However, the difference 
in the biases is substantial only for a > 0.4 for U(-l,l), and a > 0.7 for 
U(l,3). It is rather counterintuitive that the bias in much smaller in the 
U(l,3) design which contains much less information about the slope. 

2. Variance a 2 : The means of the AML, RAML, ML9 and REML9 
estimators of the variance a 2 are plotted in Figures 5 and 6 for the respective 
designs U(-l,l) and U(i,3). The bias of the REML9 estimator is negligible 
for up to a = 0.9. The ML9 and REML9 have comparable biases, and the 
bias of the AML estimator is the largest. However, the bias of the latter is still 
ignorable for a 2 < 0.4 in both designs. Again the bias of all four estimators is 
much smaller for the U(l,3) design which contains less information about a 2 . 
Note that in ML and REML methods the standard deviations are estimated, 
and so the corresponding estimator of the variance is nonnegative. As a 
result these estimators of the variance have a positive bias for small values 
of o* 2 . 
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3. Mean squared error of the estimators of the slope: Figures 7 and 8 
contain the plots of the mean squared errors for the five studied estimators. 
The GLS estimator is only slightly less efficient than its AML and RAML 
counterparts for the U(-l,l) design, and its efficiency breaks down only for 
a > 0.5 in the U(l,3) design. The ML9 and REML9 estimators of /3 are sub- 
stantially more efficient for large values of a in U(-l,l) design (the efficiency 
reaches about 135% for a = 1), but they are less efficient than the AML and 
RAML estimators for the U(l,3) design throughout the range of a. Note that 
for the U(-l,l) dt-ign the mean squared error is an increasing function of a 
for a > 0.3, but in the U(l,3) no such trend can be observed. 

4. Standard errors for the estimators of the variance a 2 : AML and 
RAML methods estimate the variance c 2 , and in our implementation nega- 
tive estimates of the variance were allowed. These methods can be adapted 
for estimation of the standard deviation c in the obvious way, and the corre- 
sponding standard error can be calculated by application of the chain rule. 
However, the sign of the value of the estimate is not determined, and therefore 
the definition of the corresponding mean squared error is ambiguous. The 
ML9 and REML9 methods estimate the standard deviation <r, from which an 
estimator of a 2 can be defined, but it does not allow negative values of the 
variance. The sampling distribution of the estimators of a would be expected 
to be a censored normal. The histogram of the 48 positive estimates of a 
by AML method in U(-l,l) design (Figure 9) provides strong evidence to 
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the contrary. Substantially smaller proportion of the estimates of cr are close 
to 0 than would be expected. Generally, the distributions of the variance 
estimators appear to resemble the normal distribution more closely than the 
estimators of standard deviation, although for large values of the variance 
the difference is not substantial. We therefore compare the ML9 and REML9 
estimators for the standard errors with their AML and RAML counterparts 
only for a > 0.5, where no negative estimates of a 2 have occurred. Figures 
10 and 11 contain plots of the means of the four estimators of standard er- 
rors. The correction for bias appears to increase the standard errors, though 
only marginally. The approximate methods, AML and RAML, yield more 
efficient estimators than their exact counterparts, ML9 and REML9, though 
the difference is insubstantial. 

We have replicated a small sample of the simulations of the ML9 estima- 
tor using 7- and 11- point Gaussian quadrature. No corresponding sets of 
estimated quantities (parameters and standard errors) differed by more than 
10~ 4 . 

The value of the deviance (-2 log-likelihood) was obtained for each method. 
It provides information about the quality and power of the (approximate) 
likelihood ratio tests for the hypothesis of independence (cr 2 = 0). The sam- 
pling distributions for the corresponding deviance differences for all four 
methods closely resemble the x\ distribution. Figure 12 contains the cor- 
responding qq-plot for the AML method, U(-l,l) design. 
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The naive t-ratio test for zero variance closely agrees with the likelihood 
ratio criterion. The power of these tests is very low; the observed power 
of the likelihood ratio test for the null hypothesis of a 2 = 0 when the true 
variance is, say, 0.09, is only 44% for the U(-l,l) design, and 19% for the 
U(l,3) design (at the 5% level of significance). Clearly, with such designs 
there is little scope for modelling more complex covariance structures than 
the equicovariance one. 

10 Discussion 

The approximation to the log-likelihood and its partial derivatives for logistic 
regression with random coefficients provides an alternative derivation of the 
GEE approach of Liang and Zeger (1986). The approximation highlights 
the problematic nature of both methods of estimation when the estimated 
between-cluster variation is large. 

The loss of information about the regression parameters attributable to 
clustering in logistic regression is much smaller than in the normal regression, 
and, in addition, to the cluster sizes it depends on the distribution of the 
predicted variances u>,j. 

The AML approach discussed in this paper is a model- and estimation- 
framework parallel with the GEE approach of Liang and Zeger (1986). In 
addition, the AML approach is supported with an approximation to the 
log-likelihood, and for random regression coefficients, with a model descrip- 
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tion which enables data simulation, and in principle, detailed model diag- 
nostic procedures, such as those described by Pregibon (1981). Also the 
parametrization implied by these models is a natural one, and is easy to 
transform from the linear scale to the scale of the outcomes, such as to the 
probabilities in logistic regression. 
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Appendix 

Approximation for the likelihood of the ran- 
dom effects model with non-normally distri- 
buted outcomes 

We consider the density of a distribution in the exponential class 

/(y,0,r) = exp[a(r){y0-6(0)} + c(y;r)], (1) 

in which the parameter 6 is a function of the linear predictor x/3. We will 
use the following expansion: 

y6{x0 + zS) - b{6{x(3 + z6)} 

« y${xP) - b{6{xf3)} + z%0'(x/3) - V{6{xP)}6'(xP)] 
+ I(z6) 2 [t/0"(x/3) - 6"{<?(x/3)}{0(x/3)} 2 - &'{0(x/3)}0»(x/3)] 

Ao(x/3) + z6A a (x/3) - |(z6) 2 A 2 (x/3). (2) 

We assume that a(r) > 0 and A 2 {9) > 0 for all r and 0. Note that these 
functions are related to the variance of the distribution (1). 
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Suppose each observation i = 1, . . . , n of a cluster has the density f(yi,6i, r) 
with 0, = 0(xi(3 + z,-5), where x, and z, are given vectors and 6 ~ N r (0, Si) 
for a positive definite matrix E\ . The observations are assumed conditionally 
independent given 6. The joint density for the cluster can be approximated, 
using the expansion (2), as 

{(2?r) r det 27i}~* J ... jf[f{yiA^P + z,*),r}exp (-^ T 27 1 *) d6 

K 

« {(27r) r det SiH lir=i HViAxiPM 

/. . ./exp [a(r) {A^ - |6 T (Z T A 2 Z + S" 1 ^}] 
= nr=i/{y»^(x,/3),r}[{a(r)}"detG]-lexp{|a(r)e T e-|e T Vr 1 e} ) 

(3) 

where e = AiAj\ Vj = a" 1 (r)(A 2 + A 2 Z17 1 Z T A 2 ) ) 6 = 1, + Z T A 2 ZE U 
Ai = {AiCxi/3), . . . , AiCxn/3)}, A 2 = diag{A 2 (x 1 /3), . . . , A 2 (x„/3)}, and Z = 
(z[,...,z^) T . Note that 

V- 1 = a(r)(A 2 " 1 - ZE 1 G- 1 Z T ). 
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Next suppose there are clusters j = 1, . . . , N2 with rij observations each, with 
vectors of outcomes yj and design matrices Xj = (x^-, . . . >z^ j) T and Zj, and 
conditionally on 7 € R 5 each cluster has the (approximated) joint density 
(3) with 9(xi/3) replaced by dfcjfl + u^-jr), and let Uj be the rij x s matrix 
containing the rows uy. Suppose these clusters are conditionally independent 
and that 7 — N 5 (0, I7 2 ), where J£ 2 is a positive definite matrix. The joint 
density for these N = n j observations can be approximated, using (2), 



as 



F{Y,0(X/3),r} [{a(r)}"nSidet G,- 



{(2jr)Met 27 2 }"5 exp {\a(r) ^ eje,- - § E j e/V^e,} 



/ . . . / exp [a(r) {£; AyU^ - \l T (Ej UjA 2i U j + 2?, 1 ) nr}] *y 
= F{Y,0(X/3),r} [{a(r)} ;v nfiidetG i detH] exp {|a(r)d T d - |d T V 2 " 1 d} 
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where 

F{Y, 6>(X/3), r} = nil fiVii>*{*H0)> *h 
i « 

H = I + E i UjBjV~ 1 B 2 U i 17 2 , d = (e7,,..,e^) T , e if G i? A y , A 2j , and 
Vij are defined as the corresponding vectors or matrices e, G, Ai, A 2 , and Vi 
for cluster j, B a = (Aji, A a T 2 , . . . , A^)" 1 ", B 2 = diag(A^, Aj 2 ,. . . , Aj N2 ) T , 
and 

V 2 = diag(V li ) + a- 1 (r)B 2 U27 2 U T B 2 T , 



whereU-(Uj",...,U^ 2 ) T . 

Extensions to further layers of nesting are analogous; in the appropriately 
altered notation we hcive the following approximation for the joint density of 
a cluster of order k (containing Nk-i clusters of order fc - 1, each of these 
containing clusters of order k — 2, and so on): 



F{Y,©(X/3),r} 



k N k 

a(r)fnii det G ^ 



t=i 



It. 1 



exp l -e e 



ie T VJ l e), (5) 



where the (generalized variance) matrix V*, is defined recursively: 



V, = diagCVfc.!,) + A 2 Z fc 27 fc ZjA 2 , 
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and G fc = 1 + 23; ZiyVj^Zyl?*. Note that the product of the determinants 
in (5) is equal to detV. 

An approximate maximum likelihood estimator for all the unknown param- 
eters can be defined as the maximizer of (5). The first- and second-order 
partial derivatives of the logarithm of (5) with respect to the regression pa- 
rameters /3 have the approximations (17) and (18) (the dependence of the 
weights on the linear predictor has to be ignored). This implies the general- 
ized least squares estimator. The first-order partial derivative with respect 
to a covariance structure parameter 6 is 

and the expectation of the second-order partial derivative has the approxi- 
mation (setting E (ee T ) « V) 

The approximations (3) - (7) can be extended for singular matrices Si,S 2y 
. since the approximate joint density (5) can be analytically extended 
to the boundary of the parameter space for these variance matrices. Also, 
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extension for non-constant scale a(r) is trivial. 

Longford (1988) gives an approximation to the joint density for clustered 
observations, similar to the derivation presented here, with the conditional 
density (1) replaced by the extended quasilikelihood function of Nelder and 
Pregibon (1987). 
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Table 1: The workload of the interviewers; Interviewer Variability Example 



12 12 13 14 15 15 16 17 17 17 

18 18 18 19 19 20 20 20 21 21 

22 23 23 23 23 24 24 24 24 26 

26 28 30 31 36 42 45 47 60 85 



Table 2. Regression model fits to the interviewer data. 
The methods axe: generalized least squares (GLS), approximate maximum 
likelihood (AML), restricted approximate maximum likelihood (RAML), gen- 
eralized estimating equations (GEE), exact maximum likelihood using 3-, 5- 
and 9- point quadrature (ML3, ML5, and ML9, respectively), and restricted 
exact maximum likelihood (REML), using 9-point quadrature. The estimates 
are given in the columns 2-7 and the corresponding standard erors under- 
neath in parentheses. The deviances are the values of -2 log-likelihood. 

Legend: 

1 The RAML deviance for the GLS model fit 

2 Estimate of the working correlation (GEE) 
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Table 2: Regression model fits to the interviewer data. 







PA RAMETER ESTIM ATES 






MODEL FITTING INFORMATION 


METHOD 


Intercept 


RSEX 


IPOL 


ICON 


ISEX 


a 1 


a 


Iterations 


Deviance 


Comp. Time 


GLS 


-.5965 
(.2937) 


-.2705 
(.1515) 


-.0708 
(.0707) 


-.1793 
'(.1370) 


-.1142 
(.1671) 






4 


1071.152 


5.27 


A ML 


-.5906 
(.3041) 


-.2686 
(.1518) 


-.0736 
(.0729) 


-.1823 
(.1418) 


-.1128 
(.1740) 


.0136 
(.0486) 


.1166 
(.2085) 


3 


1071.055 


10.33 


RAML 


-.5817 
(.2937) 


-.2656 
(.1515) 


-.0779 
(.0707) 


-.1875 
(.1370) 


-.1101 
(.1671) 


.0391 
(.0524) 


.1976 
(.1327) 


4 


1092.459 
1093.114 1 


13.07 


GEE 


-.5900 
(.3380) 


-.2680 
(.1478) 


-.0735 
(.0743) 


-.1827 
(.1467) 


-.1132 
(.2051) 


.0324 2 




3 




12.85 


ML3 


-.5934 
(.3050) 


-.2692 
(.1519) 


-.0739 
(.0520) 


-.1825 
(.1423) 


-.1126 
(.1741) 




.1162 
(.1993) 


4 


1071.056 


59.59 


ML5 


-.5931 
(.3052) 


-.2692 
(.1519) 


-.0739 
(.0521) 


-.1826 
(.1424) 


-.1128 
(.1741) 




.1173 
(.2017) 


4 


1071.055 


80.68 


ML9 


-.5931 
(.3052) 


-.2692 
(.1519) 


-.0739 
(.0521) 


-.1826 
(.1424) 


-.1128 
(.1741) 




.1173 
(.2016) 


4 


1071.055 


122.75 



REMI, -.5841 .2661 -.0774 .1889 -.1130 .1953 4 1092.373 84.22 

(.2080) (.1508) (.0717) (.1386) (.1652) (.1360) 1093.024 



Table 3: Analysis of the complete national data for the four most frequent conditions. 
Standard errors are given in parentheses ( ) and the ML9 estimates in braces [ ]. The standard errors 
for the ML9 estimates are omitted (to conserve space); they differ from their counterparts by less than 
0.0002 for the mean logit, and less than 0.0006 for the variance cr 2 . 



DEATH RATES AND THEIR BETWEEN-HOSPITAL VARIATION 
COMPLETE NATIONAL DATA, FISCAL 1986 


l!nn fl i t inn <? 


Hospitals 


Patients 
Died 


Death 
rt (%) 


GLM 


AML 
fMLl 


a 2 


Lklh. 
ratio 


Pneumonia 


5628 


415,179 
77,961 


18.78 


-1.4645 
(0.0040) 


-1.5057 
[-1.4811] 
(0.0059) 


0.2830 
[0.2846] 
(0.0061) 


1689.03 
[1710.12] 


Heart 
Failure 


5541 


465,229 
71,223 


15.31 


-1.7106 
(0.0041) 


-1.7184 

[-1.7037] 

(0.0054) 


0.2109 
[0.2071] 
(0.0068) 


559.45 
[553.63] 


Stroke 


53 tO 


298,306 
60,617 


20.32 


-1.3664 
(0.0046) 


-0.13549 
[-1.3370] 
(0.0063) 


0.2598 
[0.2518] 
(0.0075) 


813.81 
[805.81] 


Heart 
Attack 


5285 


278,114 
70,942 


25.51 


-1.0717 
(0.0044) 


-1.0595 
[-1.0406] 
(0.0057) 


0.2085 
[0.2033] 
(0.0071) 


507.21 
[499.11] 



Table 4: Logistic regression for death rates (pneumonia), with adjustment for severity, stratification and 
PROCESS (as indicated in the rows), year 1981/82. 

HOSP denotes the random effects due to the hospitals, STRAT the stratifying variables, PROC the 
process variable, and SEVER the measures of severity (they include APACHE as a variable). The 
standard errors for a 2 and given in parentheses ( ), and the deviance corresponding to a 2 = 0 (the GLS 
deviance) in brackets [ ]. 



PARAMETER ESTIMATES (ST. ERRORS) 



Adjustment Deviance Regression 

for APACHE PROCESS a 2 [GLS deviance] parameters 



HOSP 






0.0960 
(0.1610) 


1019.27 
[1019.66] 


1 


HOSP, STRAT 






0.0711 
(0.1579) 


1014.63 
[1014.85] 


6 


HOSP, PROC 




-0.1424 
(0.0840) 


0.1099 
(0.1622) 


1016.31 
[1016.82] 


2 


HOSP, SEVER 


0.955C 
(0.1074) 




0.0036 
(0.2123) 


692.82 
[692.82] 


12 


HOSP, SEVER 
PROC 


0.988,3 
(0.1100) 


-0.1540 
(0.1013) 


0.0092 
(0.2130) 


690.50 
[690.51] 


13 


HOSP, SEVER 
PROC, STRAT 


0.9841 
(0.1 107) 


-0.1 G2G 
(0.1030) 


-0.0093 
(0.2104) 


687.57 
[687.57] 
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Table 5: Logistic regression for death rates (pneumonia), with adjustment for severity, stratification and 

PROCESS, 1985/86. 

The same notation is used as in Table 4. 



PARAMETER ESTIMATES (ST. ERRORS) 


Adjustment 
for 


APACHE 


TROCESS 




Deviance 
[GLS deviance] 


Regression 
parameters 


IIOSP 






-0.0030 
(0.1240) 


1208.60 
[1208.60] 


1 


HOSP, STRAT 






-0.0289 
(0.1207) 


1204.81 
[1204.86] 


6 


IIOSP, PROC 




-0.2721 
(0.0762) 


0.0172 
(0.1263) 


1195.78 
[1195.80] 


2 


IIOSP, SEVER 


0.8032 
(0.0937) 




-0.1053 
(0.1461) 


915.31 
[915.77] 


12 


IIOSP, SEVER 
PROC 


0.8459 
(0.09G0) 


-0.1994 
(0.0884) 


-0.1172 
(0.1456) 


910.29 
[910.84] 


13 


IIOSP, SEVER 
PROC, STRAT 


0.8521 
(0.0963) 


-0.1876 
(0.0892) 


-0.1542 
(0.1498) 


906.61 
[907.53] 
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Approximations to the deviance. Interviewer variability example 
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STANDARD DEVIATION 



Figure 1: Profiles of the approximations to the -2 log-likelihood for the Interviewer 
variability data as a function of the standard deviation a. 

The methods of approximation are: : AML, approximate log-likelihood (16); 

ML3; ML5; ML7; ML9 (MLfc stands for the fc-point 

Gaussian quadrature approximation to the -2 log-likelihood). 



Approximations to the deviance. Interviewer variability example 
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Figure 2: Profiles of the approximations to the -2 log-likelihood for the Interviewer 
variability data as a function of the variance a 2 . 

The notation and methods of approximation are the same as in Figure 1. 



l>0 



Bias of the estimators of the slope, U(-1 ,1) 
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Figure 3: Bias of the estimators of the slope as a function of the standard deviation 
a in the simulated data, U(-l,l) design. 

The estimators are: GLS (generalized least squares); A ML (ap- 
proximate maximum likelihood); RAML (approximate restricted maxi- 
mum likelihood); ML9 (maximum likelihood using 9-point quadrature); 

REML9 (restricted maximum likelihood using 9-point quadrature. The 

scale of the vertical axis is in -.001 (most of the recorded biases are negative). 



Bias of the estimators of the slope, U(1 ,3) 
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Figure 4: Bias of the estimators of the slope as a function of the standard deviation 
a in the simulated data, U(l,3) design. 

The estimators are GLS, AML, RAML, ML9 and REML9, as in Figure 3. 



Estimators of the variance, U(-l,1) design 
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Figure 5: Comparison of the estimators of the variance a 2 ; U(-l,l) design. 

The estimators are: AML (approximate maximum likelihood); 

RAML (approximate restricted maximum likelihood); ML9 (max- 
imum likelihood using 9-point quadrature); REML9 (restricted maximum 

likelihood using 9-point quadrature); denotes the exact value of the variance. 
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Estimators of the variance, U(1 ,3) design 
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Figure 6: Comparison of the estimators' of the variance a 2 ] U(l,3) design. 
The notation is the same as in Figure 5. 



Observed MSE for the estimators of the slope, U(-1 ,1) 
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Figure 7: Mean squared error of the estimators of the slope as a function of the 
standard deviation a in the simulated data, U(-l,l) design. 

The estimators are: GLS; AML; RAML; ML9; 

REML9. The units on the vertical axis are 10"" 3 . 



Observed MSE for the estimators of the slope, U(1 ,3) 
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Figure 8: Mean squared error of the estimators of the slope as a function of the 
standard deviation a in the simulated data, U(l,3) design. 
The notation is the same as in Figure 7. 



Positive AML estimates, var = 0, U(-1 ,1 ) design 
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Estimates of standard deviation 



Figure 9: Distribution of the positive estimates of a for the parameter value a = 0 
in simulated data, U(-l,l) design. 

In the 100 simulations 48 positive estimates of the variance were obtained. 
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Estimated standard errors for the variance, U(-1 ,1 ) 
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Figure 10: Estimated standard errors for the variance estimators for the simulated 
data, U(-l,l) design. 

The methods of estimation are: AML; RAML; ML9; 

REML9. The units on the vertical axis are 10~ 2 . 



Estimated standard errors for the variance, U(1 ,3) 
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Figure 11: Estimated standard errors for the variance estimators for the simulated 
data, U(l,3) design. 

The notation is the same as in Figure 10. 



Approximate likelihood ratio statistic, U(-1,1) design 




THEORETICAL QUANTllES 



Figure 12: The qq-plot of the empirical null-distribution of the likelihood ratio 
statistic, against the xl distribution, for the hypothesis of within-cluster indepen- 
dence (a 7 = 0); the simulated data, U(-l,l) design. 
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